Goto

Collaborating Authors

 lip 1


SupplementaryMaterial: RobustOptimalTransport withApplicationsinGenerativeModelingand DomainAdaptation 1 Proofs

Neural Information Processing Systems

Y The constraint P X,P Y Prob(X) states that P X and P Y are valid probability distributions. For brevity, we shall ignore explicitly stating it in the rest of the proof. The above equation is similar in spirit to the Kantrovich-Rubinstein duality. An important observation to note is that the above optimization only maximizes over a single discriminator function (as opposed to two functions in optimization (2)). Hence, it is easier to train it in large-scale deep learningproblemssuchasGANs.




Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

Kawata, Ryotaro, Suzuki, Taiji

arXiv.org Machine Learning

Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $ν= I^{-1} \sum_{i=1}^I μ^{(i^*)}$ and a query $x_{\mathrm{q}}(i^*)$, the task decomposes into (i) recall of the relevant component $μ^{(i^*)}$ and (ii) prediction from $(μ_{i^*},x_\mathrm{q})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.


Policy Transfer for Continuous-Time Reinforcement Learning: A (Rough) Differential Equation Approach

Guo, Xin, Lyu, Zijiu

arXiv.org Artificial Intelligence

This paper studies policy transfer, one of the well-known transfer learning techniques adopted in large language models, for two classes of continuous-time reinforcement learning problems. In the first class of continuous-time linear-quadratic systems with Shannon's entropy regularization (a.k.a. LQRs), we fully exploit the Gaussian structure of their optimal policy and the stability of their associated Riccati equations. In the second class where the system has possibly non-linear and bounded dynamics, the key technical component is the stability of diffusion SDEs which is established by invoking the rough path theory. Our work provides the first theoretical proof of policy transfer for continuous-time RL: an optimal policy learned for one RL problem can be used to initialize the search for a near-optimal policy in a closely related RL problem, while maintaining the convergence rate of the original algorithm. To illustrate the benefit of policy transfer for RL, we propose a novel policy learning algorithm for continuous-time LQRs, which achieves global linear convergence and local super-linear convergence. As a byproduct of our analysis, we derive the stability of a concrete class of continuous-time score-based diffusion models via their connection with LQRs.




Asymptotic Guarantees for Generative Modeling Based on the Smooth Wasserstein Distance

Neural Information Processing Systems

Minimum distance estimation (MDE) gained recent attention as a formulation of (implicit) generative modeling. It considers minimizing, over model parameters, a statistical distance between the empirical data distribut ion and the model. This formulation lends itself well to theoretical analysis, but typ ical results are hindered by the curse of dimensionality.


Metric spaces of walks and Lipschitz duality on graphs

Arnau, R., Cortés, A. González, Pérez, E. A. Sánchez, Sanjuan, S.

arXiv.org Artificial Intelligence

The suggested procedure involves integrating the proximity function null P as a mechanism to guide exploration on the space of walks. While the use of null P that we have explained has focused on classification and metric analysis, its geometric interpretation and ability to quantify similarity between walks suggest a broader applicability, particularly in settings where the reward landscape is sparse or the graph structure is too large for exhaustive exploration. We propose an improvement to the exploration strategy used in reinforcement learning algorithms that incrementally construct walks within graph-based environments. Traditionally, these algorithms alternate between exploitation (choosing the next node to maximize an estimated reward) and exploration (randomly selecting a new node). The novelty lies in replacing random exploration with a proximity-guided strategy using a function null P . Instead of sampling uniformly, the agent compares potential path extensions to a reference set of high-reward walks, prioritizing those that are most similar in structure. This approach introduces a more informed, data-driven method for exploration, focusing on areas of the graph that resemble previously successful trajectories.